by Tyler Julian Last Updated: 3/29/2018
========================================================
In the finance, everyone is looking at getting the best deal. Best interest rate, best return, low risk and high reward… as both borrowers and lenders, we want it all. This primarily takes form in the interest rate of a loan. Unfortunately, the interest rate that borrowers want and what lenders want do not always line up. Even worse is trying to understand the factors that determine the interest rate to begin with. For borrowers, credit score is considered the most popular variable, but what other factors impact the loans? The amount of the loan? Where you live? How much income you have? This project hopes to explore some of these trends in a real-life dataset and reveal the biggest factors that influence a borrower’s interest rate.
This dataset contains loan data from Prosper, an online marketplace that brings investors and borrowers together to fund small to medium sized loans. Borrowers can get loans up to $35,000, with the interest rates set by Propser. Lenders get to choose what loans they want to fund, with higher risk loans providing a higher return.
The dataset can be downloaded from this project’s GitHub repository.
A logical start to understanding this dataset would be to look at its shape and variables.
## [1] 113937 81
The dataset contains 113,937 observations from 81 different variables.
Let’s look at a sample of those variables:
## [1] "ListingKey" "ListingNumber" "ListingCreationDate"
## [4] "CreditGrade" "Term" "LoanStatus"
## [7] "ClosedDate" "BorrowerAPR" "BorrowerRate"
## [10] "LenderYield"
There are a multitude variables in this dataset, some of whose meaning may not be intuitive at first glance. This reference, adapted from the Prosper API, provides context to each of the variables.
One of most important variables for both a borrower and a lender is the interest rate of the loan. The interest rate is ultimately set by Prosper and is determined by a variety of variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The interest rate curve seems to be slightly right skewed, with an abnormal spike at 32%. This could be because of overlapping binwidths. Lets graph interest rate again, but with smaller binwidths to investigate:
Unfortunately, it seems binwidth is not the answer. 32% still has an extremely high appearance count. At the moment, there appears to be no answer as to why this rate occurs so frequently. One option would be to explore a subset of the dataset that only had 32% interest rate users and compare them with the entire population for differences. However, that is beyond the scope of this analysis, as there are other variables to explore.
As one of the most well known indicators of an individual’s financial credibility, Credit Score is the most basic requirement for getting a loan.
This dataset does not provide a singular credit score, but rather a range. The two variables, CreditScoreRangeLower and CreditScoreRangeUpper, store the upper and lower bounds of a user’s credit score. To visualize the data properly, the two ranges need to be combined into an average. The newly created variable will be named CreditScoreAverage.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 670.0 690.0 695.6 730.0 890.0 591
With a mean of 695.6 and a median of 690, the central tendency of the distribution seems to be consistent. There are 591 NA’s in the data and the 1st and 3rd quartiles are within expectations. However, the minimum value seems way off the charts in terms of its distance from the rest of the data. Let’s plot the data to get a better understanding of its shape.
There seems to be a handful of credit scores close to zero. Because these scores are abysmally low and are no where near the rest of the distribution, they are marked as outliers. With this new knowledge, the data is plotted again with smaller binwidths, better breaks, and without the outliers.
After those adjustments, the data now has a semi-bell curve. From 680 to 900, the shape of the plot is almost perfect in terms of being normal. From 440 to 680, it has more deviation, but still follows a normal shape.
This plot follows the initial intuition about credit score. It makes sense that most people fall right in the middle of the ranges, with a majority of the scores floating around the average score of 700. You have a smaller subset of people that maybe haven’t been up to date on payments or have defaulted, leading to very poor scores. There is also another subset of people who are maybe very diligent about their payments, resulting in high scores.
A borrower that has more credit lines available can have both good and bad complications. Many
## num [1:113937] 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 385 1351 2467 3553 4804 6367 7449 8945 8985 8731 8152 7500 6530 5677 4927
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 3985 3468 2619 2242 1730 1377 1068 828 670 563 446 348 251 205 145
## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 119 91 75 62 39 40 34 23 23 13 10 8 3 1 4
## 45 46 47 48 51 52 54 56 59
## 3 1 3 3 1 3 3 2 1
Many lenders also look at a borrower’s income when determining whether to fund a loan. A borrower with more disposable income should be more likely to pay their monthly payments on a loan and are thus considered a less risky investment.
Income for borrowers is stored in the variable IncomeRange:
## Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
str(IncomeRange) shows that income is factored into levels:
## [1] "$0" "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
Two things to note:
For this analysis, the values need to be be ordered from least to greatest. This will provide a much more organized plot and move the data type from nominal to ordinal. The levels Not employed and $0, while still technically having different meanings, both represent that the buyer does not have a primary career that provides income. As a result, these two levels be combined to reduce the complexity of the variable.
## [1] "Not displayed" "$0" "$1-24,999" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "$100,000+"
Now that the levels have been properly transformed, the non-NA data can be visualized as a barchart:
##
## Not displayed $0 $1-24,999 $25,000-49,999 $50,000-74,999
## 7741 1427 7274 32192 31050
## $75,000-99,999 $100,000+
## 16916 17337
The chart shows that most borrowers have income that falls within two bins: $25,000-49,999 and $50,000-74,999. There are more borrowers that fall above these ranges than below them. This might suggest that borrowers with higher reported incomes are more likely to take out a loan than those with lower reported incomes.
Monthly income is very similar to income range. The biggest and most important difference is that this variable is quantitative rather than qualitative.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
There is a very large outlier in dataset with one individual claiming to bring in almost 2 million dollars every month. The top 1% of the data will be removed so the distribution can be more clearly viewed.
The Monthly Income distribution is skewed right with most individuals making about $4750. A majority of borrowers make less than $7,500 a month, while there are some borrowers tham make much more.
Debt to income ratio is exactly what it sounds like: a borrower’s debt divided by their current income at the time the credit profile was pulled.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
A quick look at the summary of the variable shows that the median ratio is 0.22, while the mean is a bit higher at 0.276. The maximum is incredibly high, showing a debt to income ratio of over 1000%. A quick look at the variable reference page shows that any ratios higher than 10.01 are rounded to that number.
In order to deal with these potential outliers, the top 1% of values is excluded from the resulting plot:
The distribution of debt to income seems to be slightly right skewed with a long tail. There are small peaks at all of the 0.5 breaks, which might suggest that the data was rounded upon retrieval. The right skew would explain the higher mean than the median. It makes intuitive sense that the majority of individuals have a low debt to income ratio on average, with the count decreasing as the debt to income ratio climbs higher. It becomes increasingly hard to live as the debt to income ratio increases.
Term length is the length of time the loan has until it has to be paid off. A longer term length means the borrowers have to pay less per month, but perhaps more interest over the life of the loan.
## [1] 36 60 12
Looking at only the unique values of Term, the only terms offered by Prosper are 12, 36, and 60 term lengths. The API shows that this variable is measured in months.
The Term variable currently has no levels, but can be easily factored since there are so few unique values.
##
## 12 36 60
## 1614 87778 24545
It appears that the vast majority of loans have a term length of 36, accounting for 77% of all loans. Term lengths of 60 make up most of the remaining loans, with 12 month terms only accounting for 1.4% of all loans.
For lenders, how much return they receive from their investment is a very important factor when choosing which loans to fund. Prosper has a variable for this value, EstimatedReturn, displayed as a percentage. This variable is calculated by taking the difference from a loan’s EstimatedEffectiveReturn and EstimatedLoss. Both of these variables can be found in detail from the variable reference page.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
A summary shows that the average expected return is around 9.5%. This is slightly higher than 6.7% average total return listed on their website. There are also a lot of NAs for this particular variable, probably attributable to the fact that this variable wasn’t be tracked until the start of July 2009.
The distribution appears to be approximately normally distributed. However, some of the values seem to have very high counts, which makes it much harder to see the data in the tails. By scaling the y-axis by square roots, the tails should be easier to see.
Now the tails are much easier to see. The distribution does indeed remain approximately normally distributed, with the peak right around the mean and median. Most of the estimated returns are positive.
## [1] 0.9982798
In fact, over 99% of all loans through Prosper have a positive estimated return. From a lender standpoint, this is excellent news. However, estimations do not always live up to reality.
BorrowerState is exactly what it sounds like: the state the borrower currently resides in when acquiring the loan.
## Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
The structure of the variable shows that each state is treated as its own level.
Quickly plotting the variable shows that the levels need to be reordered if they are going to be visualised in a bar chart. This data will be reordered from highest count to lowest count.
California appears to have much more borrowers than any other state. This is understandable sense Prosper was founded in San Francisco California. The states following California–Texas, New York, Florida–are also metro hubs with large tech industries.
In an effort to make investments easier to assess for potential investors, Prosper created the “CreditGrade” scoring system. This was eventually replaced with the “Prosper Rating” system in 2009, which is very similiar to CreditGrade. In order to analyze these variables throughout all of the data, they need to be combined into one.
## [1] "" "A" "AA" "B" "C" "D" "E" "HR" "NC"
## [1] "" "A" "AA" "B" "C" "D" "E" "HR"
Looking at the levels of the two variables, Credit Grade has an extra level called NC. The rest of both levels are the same. It is also important to note that the variables have no NA values. Rather, they are input as “”.
In order to manipulate these variables effectively, the “” values need to be substituted with NA. Once the “” values transformed into NA’s, ProsperRatings and CreditGrade can be combined into a single column and ordered from the lowest ratings to the highest rating.
## NC HR E D C B A AA
## 141 10443 13084 19427 23994 19970 17866 8881
The plot for combined Credit rating turns out very well. The data appears normally distributed with the C rating having the most values. The highest rating, AA, has the least amount of values.
Each loan has a loan status associated with it. This determines whether the current loan has been completed, whether it is late on its payments, or whether it has been defaulted on.
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
Currently, there is a lot of detail in the levels. However, the granularity of the data can be reduces by combining some of the levels. More specifically, some of the Past Due bins can be combined. FinalPaymentInProgress loans can also be lumped with Current loans.
Once the categorical bins have been combined, the levels can also be ordered from the most favorable status to the least favorable status.
##
## Completed Current PastDue (1-30 days)
## 38074 56781 1071
## PastDue (>30 days) Defaulted Chargedoff
## 996 5018 11992
## Cancelled
## 5
A vast majority of the loans are either completed or active with up-to-date payments. However, there is also a large minority of loans that either go into default or are chargedoff. These negative results account for approximately 15% of all loans from Prosperity.
Loan amount is the total amount of cash given to the borrower once the loan is approved. Loans start at $1,000 and go up to a maximum of $35,000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The loan amount data is right skewed with multi-modal peaks. These peaks are all at the $5,000 increments, which is probably attributable to the fact most people get loans at even amounts. 75% of the loans are $12,000 or less, with only a handful of loans at the maximum amount of $35,000.
Another common metric requested from lenders is the borrower’s current employment status. Employment status gives the lender a gauge on the potential cashflow the borrower currently has. More cashflow usually represents a safer loan.
## [1] "" "Employed" "Full-time" "Not available"
## [5] "Not employed" "Other" "Part-time" "Retired"
## [9] "Self-employed"
Currently, the levels are not ordered and are listed nominally. Not available and "" basically represent the same idea and can be combined.
To better analyse the variable, the levels will be ppoperly ordered and the two levels mentioned earlier will be combined.
Most borrowers are employed in some manner. Very few report being unemployed.
Along with employment status, lenders also consider the borrower’s employment duration. Longer durations usually signify more stable income, and thus a safer investment.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
A summary of the variable as well as the Prosper API confirm that EmploymentStatusDuration is measured in months. The variable will be viewed in years for the following plots, as years are much easier to visualize and understand.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.167 5.583 8.006 11.420 62.920 7625
The initial plot shows that the data is right skewed and unimodal. There is a long tail with some values way beyond the 3rd quartile. 50% of borrowers have employment durations between two and five 1/2 years.
The dataset consists of 83 variables with 113,937 observations.
The variables of interest in this analysis are:
Some of the more important observations:
The main feature of interest in this dataset is the borrower’s interest rate. The goal is to make a model that can predict the interest rate of Prosper loan, given some list of variables.
There are several variables that can be explored further that could potentially predict the interest rate a borrower has. Obviously Credit Score will be used. Income, Loan Amount, Employment Duration/Status, and Debt/Income Ratio all might be correlated with interest rate as well, and will definitely appear in the bivariate analysis. The Borrower Rating is another interesting variable to explore, especially with some of the other supporting variables.
In this analysis, there were also several variables that were adapted so they could be properly used. Income, Term Length, Loan Status, Borrower State, and Employment Status all had their levels added, combined, or reordered. One new variable, Rating, was recreated entirely using the two other rating variables.
A borrower’s credit score is widely known as being the biggest factor in determining the interest rate of a loan. Considering that, it will be the first variable paired with interest rate.
Because of the way credit scores were recorded by Prosper, the credit scores all lie along similiar x values, forming buckets of data. By adding some horizontal jitter, a better shape can be formed.
At first glance, it seems as if there might be a slight downward trend through the data. As credit score increases, the interest rate seems to decrease, an relationship that we might expect.
## [1] -0.4615667
A short correlation test supports this theory, reporting a moderate negative correlation between credit score and interest rate.
##
## Call:
## lm(formula = BorrowerRate ~ CreditScoreAverage, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49873 -0.05051 -0.01165 0.04585 0.21868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.539e-01 2.070e-03 267.5 <2e-16 ***
## CreditScoreAverage -5.190e-04 2.963e-06 -175.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0663 on 113344 degrees of freedom
## (591 observations deleted due to missingness)
## Multiple R-squared: 0.213, Adjusted R-squared: 0.213
## F-statistic: 3.068e+04 on 1 and 113344 DF, p-value: < 2.2e-16
While there is a correlation between credit score and interest rate, it only explains 21.3% of the variance in interest rate, according to the R^2 score.
With this first variable, the underlaying factors that may predict the interest rate of a loan are coming to the surface. However, credit score was an obvious first choice. What other variables in tandem with credit score might provide us with a strong predictor?
## Warning: Removed 32993 rows containing missing values (geom_point).
## [1] -0.6497361
##
## Call:
## lm(formula = BorrowerRate ~ ProsperScore, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.187124 -0.041790 -0.009099 0.036780 0.236614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.174e-01 5.251e-04 604.5 <2e-16 ***
## ProsperScore -2.040e-02 8.195e-05 -249.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05673 on 84851 degrees of freedom
## (29084 observations deleted due to missingness)
## Multiple R-squared: 0.4222, Adjusted R-squared: 0.4222
## F-statistic: 6.199e+04 on 1 and 84851 DF, p-value: < 2.2e-16
One intuition might be that the size of the loan might impact the interest rate. As the size of the loan goes up, lenders might require that the borrowers are more trust worthy and less risky compared to a small sized loan.
## [1] -0.3289599
##
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.221676 -0.053914 -0.001008 0.055049 0.283705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.256e-01 3.491e-04 646.3 <2e-16 ***
## LoanOriginalAmount -3.941e-06 3.351e-08 -117.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07065 on 113935 degrees of freedom
## Multiple R-squared: 0.1082, Adjusted R-squared: 0.1082
## F-statistic: 1.383e+04 on 1 and 113935 DF, p-value: < 2.2e-16
The scatterplot and the correlation test both support this theory. The small negative correlation means that as the size of the loan increases, the interest rate on average decreases a little as well.
However, it is important to not attribute correlation with causation, and to be aware lurking variables may be present. Higher is loans may have low interest rates because only people with secure financial portfolios can afford them.
The rating of a loan is the snapshot of a borrowers riskiness at Prosper. I suspect that there might be a relationship between the rating of a loan and its interest rate.
##
## Call:
## lm(formula = BorrowerRate ~ ratings, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.296560 -0.016560 -0.000755 0.020135 0.237194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.222360 0.002951 75.350 < 2e-16 ***
## ratingsHR 0.074200 0.002971 24.976 < 2e-16 ***
## ratingsE 0.061532 0.002967 20.740 < 2e-16 ***
## ratingsD 0.014906 0.002962 5.033 4.84e-07 ***
## ratingsC -0.031404 0.002960 -10.611 < 2e-16 ***
## ratingsB -0.068102 0.002961 -22.996 < 2e-16 ***
## ratingsA -0.107167 0.002963 -36.173 < 2e-16 ***
## ratingsAA -0.135754 0.002974 -45.642 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03504 on 113798 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.7806, Adjusted R-squared: 0.7805
## F-statistic: 5.783e+04 on 7 and 113798 DF, p-value: < 2.2e-16
There is clearly a strong relationship between interest rate and ratings; in fact it is the strongest relationship we have seen so far. The difference between the median of the lowest rating and the highest rating is almost 25%! Ratings account for 78% of the variance in interest rate according to the R^2 value. It could be that rating is tied to other variables that are correlated with interest rate, such as credict score or the amount of the loan. Regardless, this is definitely a varible that could be used to predict the interest rate of a loan.
It would make sense that individuals who have more cash flow are rewarded with lower interest rates, since it is more likely they will be able to pay their bills and thus will be a low risk loan.
## [1] -0.0889818
There does appear to be a relationship between income and interest rate, but it is not very strong. Borrowers with more income do on average have lower interest rates, with each range having a slightly lower interest rate than the previous. Stated monthly income follows a similar pattern, but further shows how weak the correlation is.
These findings was surprising. I expected income to be correlated much more highly with interest rate. It could be because income can fluctuate greatly from month to month, and sometimes isn’t verifiable.
Where someone lives in the United States can usually have a profound impact on someones financials, due to cost of living, taxes, etc. Do interest rates also fluctuate in that same pattern?
This inital plot, sorted by the median interest rate, shows that interest rate does not differ much between states for the most part. The horizontal red line is the median interest rate. Overall, there are slight deviations between between states, but they all still lie close to the overall median interest rate. The range between the highest and lowest medians is about 6%. A few of the states that deviate the most are Maine and Iowa, on the low side and Alabama and North Dakota on the high side.
Employment history and duration are other important metrics that might trend with interest rate.
For the most part, employment status does not seem too related to interest rate. Reporting as “Not employed” seems to be the only response that may be related to a higher interest rate.
cor(loans$BorrowerRate, loans$EmploymentStatusDuration, use = "complete.obs")
## [1] -0.01990744
Unfortunately, there also does not seem to be any correlation between employment duration and interest rate either
It is surprising to find that employment status and duration don’t have much of a relationship with interest rate. Inituition would say that a person
The bivariate section definitly shed some light into the relationships between interest rate and some of the other variables. Interest rate seems to be tied to credit score, though it does not account for a large portion of its variance, based on it’s R^2 value. Surprisingly, interest rates did not fluctate much between different states, even though cost of living is different amongst each of them.
The strongest relationships for interest rate were with average credit score and ratings, while the weakest relationships were with employment duration, employment status, and income.
This visualization really shows how each rating corresponds with a specific range of interest rates. The highest rating “AA” peaks at the low end of the interest rates. The rating under that, “A”, peaks at a little higher interest rate. As you move down the tier list, the interest rates peak at higer and higher interest rates, with the High Risk accounts peaking at the 32% mark.
This plot does show the small correlation between rating and loan amount. Most of the high risk loans are for smaller loans less than $5,000. The higher loan amount have almost no high risk accounts and only “C” or higher ratings
The inital coloring looks promising. There is a clear However, due to the nature that of the method the credit scores were collected, jitter will be added to the plot, similiar to what was done in the univariate section.
I believe that this plot perfectly captures the relationship between interest rates, credit score, and ratings. The lower, less risk accounts with a high credit score get teh best rating by Propser and thus the lowest interest rate. As the credit score decreases, the average interest rate increases and the ratings begin to change. There is a nice progression of the ratings as both t interest rate increases and the credit score decreases.
m5 = lm(formula = BorrowerRate ~ CreditScoreAverage + LoanOriginalAmount +
ratings + ProsperScore, data = loans)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = BorrowerRate ~ CreditScoreAverage, data = loans)
## m2: lm(formula = BorrowerRate ~ LoanOriginalAmount, data = loans)
## m3: lm(formula = BorrowerRate ~ ratings, data = loans)
## m4: lm(formula = BorrowerRate ~ ProsperScore, data = loans)
## m5: lm(formula = BorrowerRate ~ CreditScoreAverage + LoanOriginalAmount +
## ratings + ProsperScore, data = loans)
##
## ======================================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------------------
## (Intercept) 0.554*** 0.226*** 0.222*** 0.317*** 0.283***
## (0.002) (0.000) (0.003) (0.001) (0.001)
## CreditScoreAverage -0.001*** 0.000***
## (0.000) (0.000)
## LoanOriginalAmount -0.000*** 0.000*
## (0.000) (0.000)
## ratings: HR/NC 0.074***
## (0.003)
## ratings: E/NC 0.062*** -0.024***
## (0.003) (0.000)
## ratings: D/NC 0.015*** -0.073***
## (0.003) (0.000)
## ratings: C/NC -0.031*** -0.127***
## (0.003) (0.000)
## ratings: B/NC -0.068*** -0.169***
## (0.003) (0.000)
## ratings: A/NC -0.107*** -0.214***
## (0.003) (0.000)
## ratings: A/NCA -0.136*** -0.252***
## (0.003) (0.001)
## ProsperScore -0.020*** 0.002***
## (0.000) (0.000)
## ------------------------------------------------------------------------------------------------------
## R-squared 0.213 0.108 0.781 0.422 0.915
## adj. R-squared 0.213 0.108 0.781 0.422 0.915
## sigma 0.066 0.071 0.035 0.057 0.022
## F 30684.346 13825.565 57826.500 61989.938 101897.280
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 146748.641 140258.930 219909.979 123078.549 204555.091
## Deviance 498.165 568.772 139.732 273.094 40.021
## AIC -293491.283 -280511.860 -439801.958 -246151.098 -409088.182
## BIC -293462.368 -280482.930 -439715.177 -246123.052 -408985.346
## N 113346 113937 113806 84853 84853
## ======================================================================================================
Overall, this was a very interesting and enlightening dataset to explore. It was really surprising to me how impactful credit score was on interest rate, and how negligible income, locale, and employment were. Perhaps some of those other variables are used to determing whether a borrower can get a loan to begin with rather than what there interest rate will be.
The hardest part of this project was sifting through the vast amount of variables. There are so many potential combinations of categorical and continuous variables that I might have missed some interesting trends. I sometimes found myself at a dead end in the later bi/multivariate sections, wishing that I could go back and add more variables without increasing the bulk of the project. I think in the future I will use more visualization tools to preview as much of the data as possible first before diving into analyzing multiple variables at once. That way, I can sift through the variables that have no trends and go straight to the more impactful relationships.
I definitely believe that more research can be done on this dataset. There are variables that I didn’t explore due to the length of the project already. I would love to one day revisit this project and form a model that could accurately predict interest rate based off other variables like credit score or number of deliquent payments.